Frameworks for inference

  1. Analytical
  2. Maximum likelihood
  3. Resampling
  4. Bayesian

Flying snake

Flying snake

Inferring a mean

Mean undulation rate for \(n = 8\) gliding snakes:

undulation_rate <- c(0.9, 1.2, 1.2, 1.3, 1.4, 1.4, 1.6, 2.0)

What is the mean undulation rate for this sample of flying snakes?

Undulation rate

Analytical inference of mean

Arithmetic mean:

\[\hat{Y} = \frac{\sum_{i=1}^{n}Y_i}{n}\]

\[mean~undulation~rate = \frac{\sum_{i=1}^{n}undulation~rate_i}{n}\]

Analytical inference of mean

sum(undulation_rate) / length(undulation_rate)
## [1] 1.375
mean(undulation_rate)
## [1] 1.375

Maximum likelihood inference of mean

Use dnorm() to calculate the relative likelihood of an observed value \(Y_i\) drawn from a normal distribution given a mean (\(\mu\)) and standard deviation (\(\sigma\)).

\[f\left(Y_i; \mu, \sigma\right) = \frac{1}{\sqrt{2\pi\sigma^{2}}} e^{\frac{-\left(Y_i-\mu\right)^{2}}{2\sigma^{2}}}\]

Standard normal distribution

dnorm(0, mean = 0, sd = 1)
## [1] 0.3989423

Calculating a likelihood

Hypothesizing that the population mean is 0 and the standard deviation is 1, what is the likelihood of the observed values?

  1. This is a model.
  2. Calculate the relative likelihood of each observation
  3. Model likelihood is the product of the individual likelihoods
  4. log-likelihood is more tractable, so calculate that

Model Likelihood (\(\mathcal{L}\))

For a set of observations (\(Y_i\)) and hypothesized parameters (\(\Theta\); i.e., mean and standard deviation) the model likelihood is the product of the observations’ individual likelihoods:

\[\mathcal{L}\left(\left\{ Y_{i}\right\} _{i=1}^{n};\Theta\right) = \prod_{i=1}^{n}\phi\left(Y_{i}; \Theta\right)\]

\[\log\left(\mathcal{L}\left(\left\{ Y_{i}\right\} _{i=1}^{n};\Theta\right)\right) = \sum_{i=1}^{n} \log\left(\phi\left(Y_{i};\Theta\right)\right)\]

Calculating the log-likelihood for a single combination of \(\mu\) and \(\sigma\)

Hypothesizing that the population mean is 0 and the standard deviation is 1, what is the likelihood of the observed values?

Likelihood for the first observation (undulation_rate[1]):

undulation_rate[1]
## [1] 0.9
dnorm(undulation_rate[1], mean = 0, sd = 1)
## [1] 0.2660852

Calculating the log-likelihood for a single combination of \(\mu\) and \(\sigma\)

This is only the likelihood for one observation. We need the likelihoods for all 8 undulation rates to get a model likelihood.

Calculating the log-likelihood for a single combination of \(\mu\) and \(\sigma\)

Vector of likelihoods for all values in undulation_rate given mu = 0 and sigma = 1:

(rel_liks <- dnorm(undulation_rate, mean = 0, sd = 1))
## [1] 0.26608525 0.19418605 0.19418605 0.17136859 0.14972747 0.14972747 0.11092083
## [8] 0.05399097

Calculating the log-likelihood for a single combination of \(\mu\) and \(\sigma\)

Model likelihood is the product of those likelihoods:

(lik <- prod(rel_liks))
## [1] 2.308476e-07

Likelihood to log-likelihood

log(lik)
## [1] -15.28151

Rather than logging the product, we can sum the log-likelihoods:

sum(log(rel_liks))
## [1] -15.28151

For a model in which the mean is 0 and standard deviation is 1, the model log-likelihood is -15.28.

Higher likelihood

Is there another combination of \(\mu\) and \(\sigma\) that gives a higher likelihood (= larger log-likelihood)?

Try \(\mu = 1\) and \(\sigma = 1\):

sum(log(dnorm(undulation_rate, mean = 1, sd = 1)))
## [1] -8.281508

This is an improvement over \(\mu = 0\) and \(\sigma = 1\).

Calculating the log-likelihood for a range of \(\mu\) and \(\sigma\)

Find the combination of \(\mu\) and \(\sigma\) that maximizes the log-likelihood of the model for the mean and standard deviation of undulation rates.

Ranges of possible values:

  1. Mean (\(\mu\)): \(-\infty < \mu < \infty\)
  2. Standard deviation (\(\sigma\)): \(0 < \sigma < \infty\)

Grid approximation

For combinations of \(\mu\) and \(\sigma\), calculate the model likelihood. Pick the largest log-likelihood as the parameter estimates.

Set up the grid:

n <- 100                           # How fine is the grid?
mu <- seq(0.1, 3, length = n)      # Vector of mu
sigma <- seq(0.1, 0.5, length = n) # Vector of sigma

grid_approx <- crossing(mu, sigma)

grid_approx
## # A tibble: 10,000 x 2
##       mu sigma
##    <dbl> <dbl>
##  1   0.1 0.1  
##  2   0.1 0.104
##  3   0.1 0.108
##  4   0.1 0.112
##  5   0.1 0.116
##  6   0.1 0.120
##  7   0.1 0.124
##  8   0.1 0.128
##  9   0.1 0.132
## 10   0.1 0.136
## # … with 9,990 more rows

Grid approximation

log_lik <- numeric(length = nrow(grid_approx))

for (ii in 1:nrow(grid_approx)) {
  log_lik[ii] <- 
    sum(dnorm(undulation_rate,
              mean = grid_approx$mu[ii],
              sd = grid_approx$sigma[ii],
              log = TRUE))
}
grid_approx$log_lik <- log_lik
  • Iterate through the rows (\(ii\)) of grid_approx
  • For each row, assign the model log-likelihood calculated for that mu and sigma to log_lik

grid_approx
## # A tibble: 10,000 x 3
##       mu sigma log_lik
##    <dbl> <dbl>   <dbl>
##  1   0.1 0.1     -676.
##  2   0.1 0.104   -624.
##  3   0.1 0.108   -578.
##  4   0.1 0.112   -536.
##  5   0.1 0.116   -499.
##  6   0.1 0.120   -466.
##  7   0.1 0.124   -436.
##  8   0.1 0.128   -408.
##  9   0.1 0.132   -384.
## 10   0.1 0.136   -361.
## # … with 9,990 more rows
  • For a 100 X 100 grid, there are 10,000 calculations.
  • If there were 3 parameters, there would be 1,000,000.

Visualizing likelihood surface

Grid approximation

On this grid, the maximum likelihood estimates of \(\mu\) and \(\sigma\) are:

grid_approx[which.max(grid_approx$log_lik), ]
##            mu     sigma   log_lik
## 4451 1.388889 0.3020202 -1.810766

The analytical estimates are:

mean(undulation_rate)
## [1] 1.375
sd(undulation_rate)
## [1] 0.324037

Maximum likelihood via optimization

Search for the most likely values of \(\mu\) and \(\sigma\) across all possible values.

Maximum likelihood via optimization

Define a function that takes a vector of values to optimize x (\(\mu\) and \(\sigma\)) as well as a set of data Y and returns the log-likelihood:

log_lik <- function(x, Y){
  liks <- dnorm(Y, mean = x[1], sd = x[2], log = TRUE)
  return(sum(liks))
}

We can now simultaneously optimize \(\mu\) and \(\sigma\), maximizing the log-likelihood.

Maximum likelihood via optimization

reltol says to stop when the improvement is \(<10^{-100}\).

optim(c(0.1, 0.1), # Start at 0.1, 0.1
      log_lik,
      Y = undulation_rate,
      control = list(fnscale = -1,
                     reltol = 10^-100))
## $par
## [1] 1.3750000 0.3031089
## 
## $value
## [1] -1.802203
## 
## $counts
## function gradient 
##      175       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

Maximum likelihood via optimization

glm() fits generalized linear modules via optimization:

fm <- glm(undulation_rate ~ 1) # Estimate a mean only
coef(fm)
## (Intercept) 
##       1.375
logLik(fm)
## 'log Lik.' -1.802203 (df=2)

For a small enough tolerance, the maximum likelihood estimate equals the analytical estimate.

Resampling inference of the mean

set.seed(36428)
reps <- 100000
resampled_mean <- numeric(length = reps)

for (i in 1:reps) {
  resampled_mean[i] <- mean(sample(undulation_rate,
                                   replace = TRUE))
}

Resampling inference of the mean

Resampling inference of the mean

mean(undulation_rate)
## [1] 1.375
mean(resampled_mean)
## [1] 1.375399

Given enough iterations, the resampled mean equals the analytical mean (and equals the maximum likelihood mean).

Bayesian vs. ML inference

Maximum likelihood inference:

  • Probability of the data, given the parameter estimate
  • Parameters are fixed; data varies.
  • No prior possible

Bayesian inference:

  • Probability of the parameters, given the data
  • Data are fixed; parameters vary.
  • Prior required

Bayesian inference of the mean

Ranges of possible maximum likelihood values:

  1. \(\mu\): \(-\infty < \mu < \infty\)
  2. \(\sigma\): \(0 < \sigma < \infty\)

Drawbacks:

  1. \(\mu\) can’t be negative (no negative undulation rates) and probably isn’t a large number
  2. \(\sigma\) is also probably not huge either

Can we do better? Yes, Bayesian priors.

Prior for the mean

Prior for the mean

Cauchy distribution (location = 0, scale = 3)

Bayesian model

stan code:

model <- "
  data{
    int<lower=1> N;
    real undulation_rate[N];
  }
  parameters{
    real<lower=0> mu;
    real<lower=0> sigma;
  }
  model{
    mu ~ cauchy(0, 3);
    sigma ~ exponential(1);
    undulation_rate ~ normal(mu, sigma);
  }
"

Sample the Bayesian model

fm_priors <- stan(
  model_code = model,
  data = list(undulation_rate = undulation_rate,
              N = length(undulation_rate)),
  iter = 1e5,
  chains = 4)

Inspecting the samples

Summarizing the results

Summarizing the results

## Inference for Stan model: bd7dbec9aa08d046d627dd6f83a1db7d.
## 4 chains, each with iter=1e+05; warmup=50000; thin=1; 
## post-warmup draws per chain=50000, total post-warmup draws=2e+05.
## 
##        mean se_mean    sd   2.5%   25%   50%   75% 97.5% n_eff Rhat
## mu    1.370   0.001 0.144  1.080 1.284 1.371 1.456 1.655 79446    1
## sigma 0.385   0.000 0.127  0.221 0.298 0.358 0.440 0.705 71227    1
## lp__  3.046   0.005 1.150 -0.045 2.599 3.399 3.866 4.167 53932    1
## 
## Samples were drawn using NUTS(diag_e) at Mon Jan  4 14:41:41 2021.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Lower mean than the analytical or ML estimate (1.375) because the prior places more probability on lower values.